Skip to content

fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866

Open
fpenezic wants to merge 6 commits intonetbirdio:mainfrom
fpenezic:fix/stale-relay-conn-recovery
Open

fix: close stale relay connection on ErrConnAlreadyExists to recover …#5866
fpenezic wants to merge 6 commits intonetbirdio:mainfrom
fpenezic:fix/stale-relay-conn-recovery

Conversation

@fpenezic
Copy link
Copy Markdown

@fpenezic fpenezic commented Apr 12, 2026

Describe your changes

Fix peer reconnection loops after a network event (PPPoE reconnect, NAT/conntrack flush, IP rotation) leaves transport state stale on either the relay or ICE path. Both manifest as the WireGuard tunnel going silent while the client UI reports Connected, with the connection never recovering on its own.

This PR addresses two distinct but related failure modes observed in production on a peer behind PPPoE NAT (Raspberry Pi) talking to a peer with a public IP (Oracle Cloud).

Problem 1 - Relay path: stale ErrConnAlreadyExists reuse

When the peer is running over relay and a network event invalidates the existing relay session, WorkerRelay.OnNewOffer calls relayManager.OpenConn which returns ErrConnAlreadyExists because the relay client still holds a map entry for the peer key. The previous behavior was to silently bail out and reuse that entry - which is dead.

Fix (commits 1–4, 6):

  • On ErrConnAlreadyExists, close the existing relay conn and reopen it.
  • Route CloseConnByPeerKey to the correct relay client (home vs. foreign server).
  • Take rt.RLock() when reading RelayTrack.relayClient to avoid a race with openConnVia() which initializes that field under rt.Lock().
  • Gate the close-and-retry behind a relayConnStale atomic.Bool flag so we only tear down when something has signaled that the entry is no longer backed by a live peer session. Without this gate, rapid successive offers from the remote peer cause an infinite tear-down/rebuild loop (observed: 3608 cycles in ~1 hour, ~9 Mbit/s of constant traffic).
  • Signal sources for the stale flag: conn.onWGDisconnected (Relay path), WorkerRelay.CloseConn, WorkerRelay.onRelayClientDisconnected.

Problem 2 - ICE path: session ID not rotated on WG timeout

When WireGuard handshake times out while running over ICE, onWGDisconnected calls workerICE.Close(). Close() sets w.agent = nil synchronously before pion's ICE library fires the asynchronous ConnectionStateClosed event. By the time onConnectionStateChange runs closeAgent(), the w.agent == agent guard fails (w.agent is already nil) and the session ID is not rotated.

Without rotation, the next ICE offer carries the same local session ID. The remote peer in OnNewOffer compares remoteSessionID against the incoming offer's SessionID, finds them equal, and skips agent recreation - reusing the existing agent and its stale candidates from the broken network state.

Observed in production: 30s reconnect loop with the same offer session ID logged 70+ times in a row, "ICE connection succeeded" on every cycle, WG handshake never recovering. UI shows Connected, P2P while last successful WG handshake is 40+ minutes old.

Fix (commit 5): Rotate the session ID explicitly in onWGDisconnected for the ICE case (mirroring the existing Relay-path behavior we added in commit 1) so the remote peer always recreates its ICE agent after a WG timeout on ICE.

Test plan

  • Local build passes (go build ./...)
  • Deployed on two production peers (RPi behind PPPoE NAT, OCI public IP) for live testing
  • Verified stable P2P connection with WG handshake refreshing on the normal 3-minute cadence (no regression)
  • Verified no infinite tear-down loop on relay-active peer (the bug introduced by the v1 fix and resolved in v2)
  • Validated against multiple real PPPoE reconnects with public IP rotation: tunnel recovered automatically in a single ICE → Relay → ICE transition (~4 seconds), no loop, WG handshake refreshed normally on the 3-minute cadence afterwards. Stable for hours post-event.

Caveat: unrelated Rosenpass interaction

During testing we hit a separate failure mode where Rosenpass (post-quantum PSK) was enabled in permissive mode on both peers. After an IP change, the WG tunnel could not recover because the local PSK state diverged from the remote peer's PSK (one side had a Rosenpass-managed PSK, the other had none). This produced the same external symptom (silent tunnel, UI says Connected) but is not addressed by this PR - the fixes here only cover the relay/ICE transport state. With Rosenpass disabled, both fixes recovered the tunnel cleanly across multiple PPPoE cycles. Filing the Rosenpass interaction as a separate issue.

Notes

The two problems share the same trigger (network event invalidates per-peer state without invalidating the local control plane) but live in different transports, so the fixes are orthogonal. Each is necessary on its own:

  • Without the relay fix, peers stuck on relay loop the relay close/reopen cycle.
  • Without the ICE fix, peers that fall back to relay then re-establish ICE get pinned to a stale ICE agent on the remote side.

Both fixes are strictly defensive - they can only fire on ErrConnAlreadyExists (relay) or after a WG handshake timeout (ICE), so there's no behavior change on the happy path.

Issue ticket number and link

N/A - bug discovered in production, no pre-existing issue.

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

This is an internal recovery-path fix in the peer connection state machine. No public API, CLI flag, config field, or user-visible behavior changes - the fixes are strictly defensive and only fire on specific error conditions (ErrConnAlreadyExists for relay, WG handshake timeout for ICE). Nothing to document for end users or operators.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved connection timeout handling to better recover from WireGuard handshake and ICE/TURN failures by resetting session negotiation and retrying stale relay connections.
    • Enhanced peer connection stability through more robust relay connection cleanup and retry logic during connection collisions.

@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 12, 2026

CLA assistant check
All committers have signed the CLA.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 656c844c-5e27-454a-b21e-9383c92984a4

📥 Commits

Reviewing files that changed from the base of the PR and between b0b9b85 and 8c4ba8b.

📒 Files selected for processing (5)
  • client/internal/peer/conn.go
  • client/internal/peer/worker_ice.go
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client.go
  • shared/relay/client/manager.go
✅ Files skipped from review due to trivial changes (1)
  • client/internal/peer/conn.go
🚧 Files skipped from review as they are similar to previous changes (4)
  • client/internal/peer/worker_ice.go
  • shared/relay/client/manager.go
  • shared/relay/client/client.go
  • client/internal/peer/worker_relay.go

📝 Walkthrough

Walkthrough

On WireGuard relay handshake timeout the relay worker entry is marked stale before closing the relay connection and running relay-disconnect handling; after relay cleanup the ICE worker (if present) is forced to use a new session ID. For ICE/TURN-priority timeouts the ICE worker rotates its local session ID before closing the ICE worker. New relay client/manager APIs allow closing a relay connection by peer key; WorkerRelay tracks stale entries and will evict-and-retry on duplicate-open collisions.

Changes

Conn + ICE

Layer / File(s) Summary
Control / Callsite
client/internal/peer/conn.go
On WireGuard relay disconnects call workerRelay.MarkStale() before closing relay connection and handling disconnect; for relay-priority timeouts, call workerICE.ResetSessionID() after relay cleanup. For ICE/TURN timeouts call workerICE.ResetSessionID() before workerICE.Close().
Core Implementation
client/internal/peer/worker_ice.go
Added WorkerICE.ResetSessionID() which locks muxAgent, generates a new sessionID via NewICESessionID(), and clears remoteSessionID on success; logs and returns on ID generation failure.

Relay worker

Layer / File(s) Summary
State
client/internal/peer/worker_relay.go
Add relayConnStale (atomic.Bool) to track stale relay conn entries.
Offer Handling
client/internal/peer/worker_relay.go
On relayClient.ErrConnAlreadyExists either reuse existing conn when not stale, or call CloseConnByPeerKey to evict the stale entry and retry OpenConn once; clear stale flag on successful retry.
Lifecycle / API
client/internal/peer/worker_relay.go
Set relayConnStale to true in CloseConn() and onRelayClientDisconnected(); add exported MarkStale() to mark the entry stale externally.

Relay client & manager

Layer / File(s) Summary
Client API
shared/relay/client/client.go
Added Client.CloseConnByPeerKey(peerKey string) to evict/close a connection by hashed peerKey: removes mapping from conns, unsubscribes peer-state notifications, and closes the connContainer.
Manager Routing
shared/relay/client/manager.go
Added Manager.CloseConnByPeerKey(serverAddress, peerKey string) that prefers closing on the home client when ServerInstanceURL() matches, otherwise looks up the foreign relay client track for serverAddress and calls its CloseConnByPeerKey; no-op silently when clients/tracks are missing.

Sequence Diagram(s)

sequenceDiagram
  participant Conn
  participant WorkerRelay
  participant RelayClient
  participant WorkerICE

  Conn->>WorkerRelay: onWGDisconnected (priority==Relay) / MarkStale()
  WorkerRelay->>RelayClient: CloseConnByPeerKey(peerKey)
  RelayClient-->>WorkerRelay: confirm closed
  WorkerRelay->>WorkerICE: ResetSessionID() (if exists)
  WorkerICE-->>WorkerRelay: new sessionID, remoteSessionID cleared
  WorkerRelay->>Conn: continue disconnect handling / retry OpenConn if needed
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested reviewers

  • pappz
  • mlsmaycon

Poem

🐰 I twitch my whiskers when sessions stale,
I nudge old relays down the trail,
I spin a new ICE name with glee,
Evict the stale, retry the spree,
Fresh hops ahead — hop, hop, hooray!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'fix: close stale relay connection on ErrConnAlreadyExists to recover...' clearly and specifically describes the main relay-path fix and primary problem being addressed.
Description check ✅ Passed The PR description comprehensively covers required template sections: detailed problem explanation, specific fixes with commit references, test plan validation, bug fix checklist marked, and documentation justification provided.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 150-160: The CloseConnByPeerKey method currently always calls
m.relayClient.CloseConnByPeerKey(peerKey) which misses stale connections stored
under a specific server entry; modify Manager.CloseConnByPeerKey to
accept/lookup the serverAddress (use m.relayClients[srv].relayClient when
present) and call that relayClient.CloseConnByPeerKey(peerKey) instead of always
using m.relayClient; ensure the logic still handles the nil/default
m.relayClient case and preserves the mutex usage around m.relayClients access;
also update the caller in client/internal/peer/worker_relay.go to pass the srv
value into CloseConnByPeerKey so cross-relay ErrConnAlreadyExists recovery finds
and removes the stale entry created in openConnVia/OpenConn.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: dcc48e08-b6cd-48da-ad60-56304d5f8378

📥 Commits

Reviewing files that changed from the base of the PR and between 5259e5d and d2a2ff5.

📒 Files selected for processing (5)
  • client/internal/peer/conn.go
  • client/internal/peer/worker_ice.go
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client.go
  • shared/relay/client/manager.go

Comment thread shared/relay/client/manager.go
@pappz pappz self-requested a review April 13, 2026 07:53
@fpenezic fpenezic force-pushed the fix/stale-relay-conn-recovery branch from a9783e0 to 7d55ac2 Compare April 20, 2026 21:19
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@client/internal/peer/worker_relay.go`:
- Around line 66-75: The code currently treats relayClient.ErrConnAlreadyExists
as proof the existing relay is stale and unconditionally calls
w.relayManager.CloseConnByPeerKey; instead, check the existing connection's
health/state before closing or prevent concurrent OnNewOffer for the same peer.
Modify the branch handling relayClient.ErrConnAlreadyExists to first query the
existing connection via w.relayManager (e.g., a method like GetConnByPeerKey or
HasActiveConn) and only call w.relayManager.CloseConnByPeerKey when that
connection reports closed/stale (or fail a liveness check); otherwise reuse the
active connection or bail out/serialize the new offer (e.g., with a per-peer
lock) rather than tearing down a healthy relay, keeping the subsequent OpenConn
call only when a close was actually performed.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 271f46a4-0e60-4871-8a37-51c6772eb90e

📥 Commits

Reviewing files that changed from the base of the PR and between a9783e0 and 7d55ac2.

📒 Files selected for processing (5)
  • client/internal/peer/conn.go
  • client/internal/peer/worker_ice.go
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client.go
  • shared/relay/client/manager.go
🚧 Files skipped from review as they are similar to previous changes (3)
  • client/internal/peer/conn.go
  • shared/relay/client/manager.go
  • shared/relay/client/client.go

Comment thread client/internal/peer/worker_relay.go
@fpenezic fpenezic marked this pull request as draft April 22, 2026 22:02
@fpenezic
Copy link
Copy Markdown
Author

Converting to draft - I discovered a regression during extended testing.

Bug: The ErrConnAlreadyExists retry loop can fire repeatedly (observed 3600+ times over ~1 hour) in multi-peer scenarios. When remote peers send offers in rapid succession (session renewal, ICE state changes, etc. - not only after a disconnect), each offer triggers close + reopen, and the remote side sees the reconnect as another state change and sends a fresh offer, producing an infinite tear-down/rebuild loop. Each iteration allocates a new wgProxy port on 127.0.0.1 while the previous one fails with use of closed network connection, generating significant background traffic (observed 9 Mbit/s constant on the affected peer).

Root cause: my earlier reasoning that OnNewOffer is only invoked after Guard detects a disconnect was incorrect. OnNewOffer also fires on normal offer/answer exchanges, so unconditionally treating ErrConnAlreadyExists as "stale by contract" is wrong - the original CodeRabbit concern about healthy connections being torn down was valid.

Plan: Add a safeguard so the close-and-retry path only runs when the existing connection is actually stale, e.g.:

  • Cooldown: skip close if the existing conn was (re)created within the last N seconds
  • Liveness check via WGWatcher handshake state
  • Per-peer serialization of OnNewOffer so rapid back-to-back offers don't overlap

Will push a follow-up commit once I have verified the fix in the same reproduction environment (PPPoE reconnect + IP rotation).

The previous fix unconditionally closed and reopened the existing relay
conn whenever OnNewOffer hit ErrConnAlreadyExists. That caused an
infinite tear-down/rebuild loop when the remote peer sent rapid
successive offers (e.g. during reconnection): close → reopen → remote
sees teardown → sends new offer → ErrConnAlreadyExists → close → loop.

Introduce a relayConnStale atomic flag that is set only when an event
indicates the existing entry is no longer backed by a live peer
session: local WG handshake timeout, relay server close, or explicit
CloseConn. OnNewOffer only tears down and reopens when the flag is
set; otherwise it reuses the existing healthy connection.

Signal sources for the flag:
- conn.onWGDisconnected (Relay path) → MarkStale before CloseConn
- WorkerRelay.CloseConn → marks stale on close
- WorkerRelay.onRelayClientDisconnected → marks stale on server close
When WireGuard handshake times out while running over ICE,
onWGDisconnected calls workerICE.Close(). Close() sets w.agent=nil
synchronously before pion's ICE library fires the asynchronous
ConnectionStateClosed event. By the time onConnectionStateChange
runs closeAgent(), the `w.agent == agent` guard fails (w.agent is
already nil) and the session ID is not rotated.

Without session ID rotation, the next ICE offer carries the same
local session ID. The remote peer in OnNewOffer compares
remoteSessionID against the incoming offer's sessionID, finds them
equal, and skips agent recreation — reusing the existing agent and
its stale candidates from the broken network state.

Observed in production: 30s reconnect loop after a NAT/conntrack
event with the same offer session ID logged ~20+ times in a row,
WG handshake never recovering despite "ICE connection succeeded"
on every cycle.

Mirror the existing Relay-path behavior: explicitly rotate the
session ID before closing the ICE worker so the remote peer always
recreates its agent after a WG timeout on ICE.
@fpenezic fpenezic force-pushed the fix/stale-relay-conn-recovery branch from b0b9b85 to c0b3e4e Compare April 25, 2026 09:32
@fpenezic fpenezic marked this pull request as ready for review April 28, 2026 06:32
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@shared/relay/client/manager.go`:
- Around line 168-172: The code reads rt.relayClient without holding the
RelayTrack lock; fix by grabbing rt.RLock() when accessing the field, copy the
pointer to a local (e.g. rc := rt.relayClient) while holding the RLock, then
RUnlock and call rc.CloseConnByPeerKey(peerKey) only if rc != nil; this protects
RelayTrack.relayClient (used/initialized under rt.Lock() in openConnVia()) while
avoiding holding the track lock during the CloseConnByPeerKey call.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 4b1aed8a-feb3-4803-b3cf-18c9a265d315

📥 Commits

Reviewing files that changed from the base of the PR and between b0b9b85 and c0b3e4e.

📒 Files selected for processing (5)
  • client/internal/peer/conn.go
  • client/internal/peer/worker_ice.go
  • client/internal/peer/worker_relay.go
  • shared/relay/client/client.go
  • shared/relay/client/manager.go

Comment thread shared/relay/client/manager.go Outdated
…eerKey

CloseConnByPeerKey was reading rt.relayClient without holding rt.RLock(),
but openConnVia() initializes that field under rt.Lock(). Race could skip
the stale-entry close while the track was still being populated, defeating
the foreign-server cleanup path.

Take rt.RLock() to copy the pointer to a local, release the track lock,
then call CloseConnByPeerKey on the copy — protects the field without
holding the track lock across a potentially blocking network call.

Addresses CodeRabbit review on PR netbirdio#5866.
@sonarqubecloud
Copy link
Copy Markdown

@fpenezic
Copy link
Copy Markdown
Author

fpenezic commented May 5, 2026

@coderabbitai resume

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 5, 2026

✅ Actions performed

Reviews resumed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants